New Lexical Entries for Unknown Words
نویسندگان
چکیده
The following paper presents an approach for simulating the acquisition of new lexical entries for unknown words, an issue that is central to natural language processing since no lexicon can ever be complete. Acquisition involves two main tasks. First, the appropriate information about an unknown word in a given linguistic context (i.e. sentence) is identified. It is shown that this task requires new general considerations about shared information in unificationbased representations. Second, the collected information is formulated in a new lexical entry according to a comprehensive theory of the lexicon which defines the form of lexical entries and the relations between them. This task is solved by a general algorithm that depends only on the form of the collected information and is independent of the content, i.e. treats all unknown words in the same way. 1. General issues concerning unknown words The lexicon of any natural language must be viewed as an open system in which items are constantly added, modified, and deleted. Various investigators in natural-language processing (NLP) (cf. the contributions in Zernik 89) have noticed this property but treated it basically as a shortcoming of language that needs to be overcome. In contrast, we view this property as an inherent and essential aspect of natural language. Consequently, the lexicon of an NLP * Heinrich-Heine-Universität Düsseldorf, Germany. The presented work was carried out in the project "SIMLEX Simulation lexikalischen Erwerbs" (simulation of lexical acquisition) funded by the Deutsche Forschungsgemeinschaft. For helpful comments and discussions on the topic we would like to thank Gosse Bouma, Lynne Cahill, Roger Evans, Gerald Gazdar, and Dafydd Gibbon. system should reflect this characteristic so that the model itself encompasses incompleteness. Thus, when a sentence is parsed the problem may arise that no adequate entries for certain words can be found in the given lexicon, but the system must nevertheless be able to parse such a sentence. Better yet, it should utilize the sentence as a source of new information in order to extend the lexicon. There are various reasons why an adequate entry for a given word in an analyzed sentence may not be found. The most obvious one is that no entry exists for the word. In this case the system must check whether the word is phonetically and morphologically admissible in the given context. If it is admissible and not to be interpreted as an error (e.g. a misspelling), then the form is a "new" or "unknown" word. Even if a lexical entry exists, it may not be appropriate for this context since one word form may be associated with different properties, not all of which are recorded in the lexicon. The syntactic category is such a property. The German word form weichen, for example, is a verb ('withdraw') as well as a noun ('shunts') and an adjective ('soft'). Here the words are not related semantically, whereas for the German word form essen (verb 'eat' or noun 'meal') a semantic relation exists. Even if the category of the lexical entry is correct for the context, other properties may be inappropriate. For example, one verb may have different subcategorization frames, some of which are missing in the lexical entry. In our approach all admissible words for which no suitable lexical entry is found are viewed as unknown words. Many parsers simply fail upon encountering new words, and a standard parsing algorithm must be adapted to deal with them in the first place. In particular, the context of a new word must be used to assign it information which, if possible, allows the sentence containing it to be parsed; in some cases no such assignment of information to the new word allows a successful analysis. After successful parsing there are two possibilities for the further treatment of such unknown words. Either no further actions are carried out and the lexicon remains unchanged, or information about the unknown words is collected and used to formulate new lexical entries. In this case new lexemes are acquired and the lexicon is correspondingly changed. It should be clear that the first possibility is unsatisfactory since the model then fails to capture one of the main properties of lexicons. 1 At this point 'lexicon' and 'lexical entry' are pretheoretical concepts and do not presuppose any particular linguistic theory. In section 2 a theoretical characterization of the lexicon will be developed. The acquisition of new lexemes makes it necessary to update the lexicon in such a way that the new lexemes are integrated in the existing lexical structure. The process of integration differs depending on the respect in which the word is unknown. Words for which an entry, though inappropriate, exists require a different treatment than those for which no entry is found. In the former case the existing entry must be revised according to the new information. In the latter case a new lexical entry must be created; this constitutes the main issue of the paper. Here, the structure of lexical entries as well as the overall structure of the lexicon into which the new entry has to be integrated are crucial. 2. Structure of the lexicon and the lexical entries In an NLP system lexical entries must have a consistent and well-defined form. Within unification-based frameworks (cf. Shieber 86) they can be defined with templates or directly with path equations. The established structure holds not only for existing entries but also for new lexical entries since the latter have to be integrated into the current lexicon. Furthermore, this structure and the overall structure of the lexicon influence the way in which new lexical entries are formulated. So first of all the structure of the lexicon and the lexical entries have to be specified. An adequate representation for a lexicon must meet certain conditions. First of all, general requirements like computational tractability, user friendliness, and reduction of redundancy have to be met. Whereas, for example, lexical entries directly defined with path equations are computationally tractable, they are highly redundant and not easily readable for human users and are therefore unsuitable. Templates on the other hand fulfill these general requirements, but fail to model the following aspects typical of the lexicon. Characteristically, lexical entries are not isolated items but rather are related to other linguistic objects in a hierarchy (cf. the pioneering work in Flickinger/Pollard/Wasow 85). In this hierarchy various relationships hold between the items, including not only regularities but also subregularities and exceptions that cannot be captured by templates . Figure (1) illus2 The representation of these relationships requires nonmonotonic devices which are not included in standard unification formalisms. A well-defined nonmonotonic extension is presented in Bouma 90. trates these kinds of relationships in a lexicon. verb cat: v subject-case: nominative subject-status: normal intransitive transitive object-case: accusative
منابع مشابه
Using Unknown Word Techniques to Learn Known Words
Unknown words are a hindrance to the performance of hand-crafted computational grammars of natural language. However, words with incomplete and incorrect lexical entries pose an even bigger problem because they can be the cause of a parsing failure despite being listed in the lexicon of the grammar. Such lexical entries are hard to detect and even harder to correct. We employ an error miner to ...
متن کاملIranian EFL Learners’ Lexical Inferencing Strategies at Both Text and Sentence levels
Lexical inferencing is one of the most important strategies in vocabulary learning and it plays an important role in dealing with unknown words in a text. In this regard, the aim of this study was to determine the lexical inferencing strategies used by Iranian EFL learners when they encounter unknown words at both text and sentence levels. To this end, forty lower intermediate students were div...
متن کاملProcessing Unknown Words in HPSG
The lexical acquisition system presented in this paper incrementally updates linguistic properties of unknown words inferred from their surrounding context by parsing sentences with an HPSG grammar for German. We employ a gradual, informationbased concept of “unknownness” providing a uniform treatment for the range of completely known to maximally unknown lexical entries. “Unknown” information ...
متن کاملEBL2: An Approach To Automatic Lexical Acquisition
A method for automatic lexical acquisition is out lined. An existing lexicon that, in addition Io ordinary ]exical entries, contains prototypical cntrips for various non-exclusive paradigms of open-cl~,.ss words, is extended by inferring new lexical entries from texts containing unknown words. This is done by comparing the constraints placed on the unknown words hy the natural language system's...
متن کاملLexicon Acquisition with a large-coverage unification-based grammar
We describe how unknown lexical entries are processed in a unification-based framework with large-coverage grammars and how from their usage lexical entries are extracted. To keep the time and space usage during parsing within bounds, information from external sources like Part of Speech (PoS) taggers and morphological analysers is taken into account when information is constructed for unknown ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007